1.5 billion words Arabic Corpus

نویسنده

  • Ibrahim Abu El-Khair
چکیده

This study is an attempt to build a contemporary linguistic corpus for Arabic language. The corpus produced, is a text corpus includes more than five million newspaper articles. It contains over a billion and a half words in total, out of which, there is about three million unique words. The data were collected from newspaper articles in ten major news sources from eight Arabic countries, over a period of fourteen years. The corpus was encoded with two types of encoding, namely: UTF-8, and Windows CP-1256. Also it was marked with two mark-up languages, namely: SGML, and XML.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Shamela: A Large-Scale Historical Arabic Corpus

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it ...

متن کامل

Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification

With the development and the advancement of social networks, forums, blogs and online sales, a growing number of Arabs are expressing their opinions on the web. In this paper, a scheme of Arabic sentiment classification, which evaluates and detects the sentiment polarity from Arabic reviews and Arabic social media, is studied. We investigated in several architectures to build a quality neural w...

متن کامل

روشی جدید جهت استخراج موجودیت‌های اسمی در عربی کلاسیک

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers due to a significant impact on improving other NLP tasks such as Machine translation, Information retrieval, question answering, query result...

متن کامل

Words and Word Usage: Newspaper Text versus the Web

This paper explores the differences in words and word usage in two corpora – one derived from newspaper text and the other from the web. A corpus of web pages is compiled from a controlled traversal of the web, producing a topicdiverse collection of 2 billion words of web text1. We compare this Web Corpus with the Gigaword Corpus, a 2 billion word collection of news articles. The Web Corpus is ...

متن کامل

The feasibility of a complete text corpus

In the paper the annual increase in size of a complete text corpus of a single language, Slovenian, is estimated. It comprises the serial publications in Slovenian, monographs and pages, published on Internet. The estimate for the year 2000, based on 21,000 units of serial publications, 675,000 pages from 5,200 units of printed monographs, 377.000 pages from 5,500 units of unpublished monograph...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1611.04033  شماره 

صفحات  -

تاریخ انتشار 2016